# Homework_2_Group_13

categories: - "" - "" date: “2017-10-31T21:28:43-05:00” description: “About Me” draft: false keywords: "" slug: project1 title: Life & HIV author: “Study Group 13” date: “Sep 20, 2020” output: html_document: theme: flatly highlight: zenburn number_sections: yes toc: yes toc_float: yes code_folding: show —

Climate change and temperature anomalies

If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here

To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.

Run the code below to load the file:

weather <- read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv", 
           skip = 1, #skip one row as real data starts from row 2
           na = "***") #missing data is coded as *** so we specify that here
weather
## # A tibble: 140 x 19
##     Year   Jan    Feb    Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
##    <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1880 -0.54 -0.38  -0.26  -0.37 -0.11 -0.22 -0.23 -0.24 -0.26 -0.32 -0.37
##  2  1881 -0.19 -0.25   0.02  -0.02 -0.06 -0.36 -0.06 -0.03 -0.23 -0.4  -0.42
##  3  1882  0.22  0.22   0     -0.36 -0.32 -0.38 -0.37 -0.14 -0.17 -0.53 -0.32
##  4  1883 -0.59 -0.67  -0.16  -0.27 -0.32 -0.26 -0.09 -0.26 -0.33 -0.21 -0.4 
##  5  1884 -0.23 -0.11  -0.65  -0.62 -0.42 -0.52 -0.48 -0.5  -0.45 -0.41 -0.48
##  6  1885 -1    -0.37  -0.21  -0.53 -0.55 -0.47 -0.39 -0.44 -0.32 -0.3  -0.28
##  7  1886 -0.68 -0.68  -0.570 -0.34 -0.34 -0.43 -0.2  -0.47 -0.34 -0.31 -0.45
##  8  1887 -1.07 -0.580 -0.36  -0.42 -0.27 -0.2  -0.23 -0.52 -0.17 -0.4  -0.19
##  9  1888 -0.53 -0.59  -0.580 -0.24 -0.16 -0.04  0.04 -0.19 -0.12  0.04 -0.03
## 10  1889 -0.31  0.35   0.07   0.15 -0.05 -0.12 -0.1  -0.16 -0.26 -0.34 -0.61
## # … with 130 more rows, and 7 more variables: Dec <dbl>, `J-D` <dbl>,
## #   `D-N` <dbl>, DJF <dbl>, MAM <dbl>, JJA <dbl>, SON <dbl>

Next, we will clean the dataset - discard unwanted columns and convert the wide format to a long format.

# dropping columns we don't need
weather2 <- weather %>% select(-`J-D`, -`D-N`, -DJF, -MAM, -JJA, -SON)
weather2
## # A tibble: 140 x 13
##     Year   Jan    Feb    Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
##    <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1880 -0.54 -0.38  -0.26  -0.37 -0.11 -0.22 -0.23 -0.24 -0.26 -0.32 -0.37
##  2  1881 -0.19 -0.25   0.02  -0.02 -0.06 -0.36 -0.06 -0.03 -0.23 -0.4  -0.42
##  3  1882  0.22  0.22   0     -0.36 -0.32 -0.38 -0.37 -0.14 -0.17 -0.53 -0.32
##  4  1883 -0.59 -0.67  -0.16  -0.27 -0.32 -0.26 -0.09 -0.26 -0.33 -0.21 -0.4 
##  5  1884 -0.23 -0.11  -0.65  -0.62 -0.42 -0.52 -0.48 -0.5  -0.45 -0.41 -0.48
##  6  1885 -1    -0.37  -0.21  -0.53 -0.55 -0.47 -0.39 -0.44 -0.32 -0.3  -0.28
##  7  1886 -0.68 -0.68  -0.570 -0.34 -0.34 -0.43 -0.2  -0.47 -0.34 -0.31 -0.45
##  8  1887 -1.07 -0.580 -0.36  -0.42 -0.27 -0.2  -0.23 -0.52 -0.17 -0.4  -0.19
##  9  1888 -0.53 -0.59  -0.580 -0.24 -0.16 -0.04  0.04 -0.19 -0.12  0.04 -0.03
## 10  1889 -0.31  0.35   0.07   0.15 -0.05 -0.12 -0.1  -0.16 -0.26 -0.34 -0.61
## # … with 130 more rows, and 1 more variable: Dec <dbl>
# converting dataframe from wide to long format using pivot_longer( ) 
tidyweather <- weather2 %>% pivot_longer(cols=2:13, names_to="month", values_to="delta")
tidyweather
## # A tibble: 1,680 x 3
##     Year month delta
##    <dbl> <chr> <dbl>
##  1  1880 Jan   -0.54
##  2  1880 Feb   -0.38
##  3  1880 Mar   -0.26
##  4  1880 Apr   -0.37
##  5  1880 May   -0.11
##  6  1880 Jun   -0.22
##  7  1880 Jul   -0.23
##  8  1880 Aug   -0.24
##  9  1880 Sep   -0.26
## 10  1880 Oct   -0.32
## # … with 1,670 more rows

Inspecting our dataframe, we have three variables now - one each for year, month and delta (temperature deviation).

Plotting Information

We will plot the data using a time-series scatter plot, and add a trendline. But first we use ‘lubridate’ for dates to ensure that delta plots chronologically.

# reformatting dates
tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), month, "1")), # lubridate used here
         month = month(date, label=TRUE),
         Year = year(date)) 

# plotting temperature deviation 
ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point(alpha=0.5) +
  geom_smooth(color="red") +
  theme_bw() +
  theme(
    plot.title=element_text(face="bold")
  ) +
  labs (
    title = "Weather Anomalies",
    subtitle = "Trends in temperature deviations over the period 1880 - 2019",
    x = "Year",
    y = "Temperature Deviation"
  )

Next, we study the effect of increasing temperature by month.

# plotting scatterplot and trend lines by month
ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point(size=0.5, alpha=0.6)+
  geom_smooth(color="red") +
  facet_wrap(~month) + # to separate plots by month
  theme_bw() +
  theme(
    plot.title=element_text(face="bold")
  ) +
  labs (
    title = "It's Getting Hot in Here",
    subtitle="Trends in temperature deviations BY MONTH between 1880 and 2019",
    y="Temperature Deviation",
    x="Year"
  )

> Overall, we see a gradual rise in the average temperature across all twelve months from 1880 to 2019 owing to Global Warming. This rise has been greater for winter months such as January, February, November and December, compared to summer months such as May, June, July and August. > Another point worth mentioning is that the winter months depict greater deviation (the points are more scattered around the trendline) compared to summer months which have relatively lower deviation (the points are closer to the trendline)

Since it is sometimes useful to group data into different time periods to study historical data, we create a new data frame called comparison that groups data into five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.

comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

comparison
## # A tibble: 1,668 x 5
##     Year month delta date       interval 
##    <dbl> <ord> <dbl> <date>     <chr>    
##  1  1881 Jan   -0.19 1881-01-01 1881-1920
##  2  1881 Feb   -0.25 1881-02-01 1881-1920
##  3  1881 Mar    0.02 1881-03-01 1881-1920
##  4  1881 Apr   -0.02 1881-04-01 1881-1920
##  5  1881 May   -0.06 1881-05-01 1881-1920
##  6  1881 Jun   -0.36 1881-06-01 1881-1920
##  7  1881 Jul   -0.06 1881-07-01 1881-1920
##  8  1881 Aug   -0.03 1881-08-01 1881-1920
##  9  1881 Sep   -0.23 1881-09-01 1881-1920
## 10  1881 Oct   -0.4  1881-10-01 1881-1920
## # … with 1,658 more rows

Now that we have intervals, we can study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in.

ggplot(comparison, aes(x=delta, fill=interval)) + #fill to group and colour by different time intervals
  geom_density(alpha=0.2) +   #density plot with transparency set to 20%
  theme_bw( ) +               # theme white background and grey lines
  theme(
    plot.title=element_text(face="bold") # make title bold
  ) +
  labs (
    title="Distribution of Monthly Deviations",
    subtitle = "Density plot grouped by different time intervals ",
    y = "",
    x = "Temperature Deviation",
    legend.title=""
    ) +
  guides(fill=guide_legend((title="Time Interval"))) #change legend title

So far, we have been working with monthly data for temperature deviaitons. Let us study average annual anomalies next.

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%   #grouping data by Year
  
  # creating summaries for mean delta 
  summarise(annual_average_delta = mean(delta, na.rm=TRUE)) # use `na.rm=TRUE` to eliminate NA (not available) values

#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
  geom_point()+
  geom_smooth(method="loess", color="red") +  # fit the best fit line, using LOESS method
  theme_bw() +
  theme(
    plot.title = element_text(face="bold")
  ) +
  labs (
    title = "Average Yearly Anomaly",
    y     = "Average Annual Temperature Deviations"
  )                         

## Confidence Interval for delta

A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

We will construct a confidence interval for the average annual ‘delta’ since 2011, using two different methods

  1. a formula
formula_CI <- comparison %>% 
                filter(interval=="2011-present") %>% # filter for time period we're interested in
                summarise(mean=mean(delta, na.rm=TRUE), # calculate summary statistics for delta
                          SD=sd(delta, na.rm=TRUE), 
                          count=n(), 
                          SE=SD/sqrt(count),
                          lower_CI = mean - 1.96*SE,
                          upper_CI = mean + 1.96*SE
                          )

formula_CI
## # A tibble: 1 x 6
##    mean    SD count     SE lower_CI upper_CI
##   <dbl> <dbl> <int>  <dbl>    <dbl>    <dbl>
## 1 0.966 0.262   108 0.0252    0.916     1.02
  1. bootstrap simulation with the infer package
# use the infer package to construct a 95% CI for delta
library(infer)

set.seed(1234)
whatever_id_like <- comparison %>% 
  filter(interval=="2011-present") %>%      # filtering for time period we're interested in
  specify(response=delta) %>%               # specifying what we're calculating CI for
  generate(reps=1000, type="bootstrap") %>% # generate random samples or reps using bootstrap
  calculate(stat="mean")                    # calculate mean
  
percentile_CI <- whatever_id_like %>% 
                 get_confidence_interval(comparison$delta, level=0.95, type="percentile")
percentile_CI
## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.917     1.02

The first method calculates summary statistics and confidence intervals (CI) using the whole data for temperature deviations (delta) from 2011 to present. The second bootstrap method creates 1000 random samples (reps), and calculates their means on the basis of which we arrive at CIs.

The 95% lower CI and upper CI is 0.917 and 1.02 respectively, which means that out of every 1000 samples, we are confident that for 950 samples the range [0.917, 1.02] would correctly contain the true mean of the population.

General Social Survey (GSS)

The General Social Survey (GSS)gathers data on American society in order to monitor and explain trends in attitudes, behaviours, and attributes. Many trends have been tracked for decades, so one can see the evolution of attitudes, etc in American Society.

Here, we analyze data from the 2016 GSS sample data.

gss <- read_csv(here::here("data", "smallgss2016.csv"), 
                na = c("", "Don't know",
                       "No answer", "Not applicable")) # this is how NA data is in the .csv file
gss
## # A tibble: 2,867 x 7
##    emailmin emailhr snapchat instagrm twitter sex    degree        
##    <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <chr>         
##  1 0        12      NA       NA       NA      Male   Bachelor      
##  2 30       0       No       No       No      Male   High school   
##  3 NA       NA      No       No       No      Male   Bachelor      
##  4 10       0       NA       NA       NA      Female High school   
##  5 NA       NA      Yes      Yes      No      Female Graduate      
##  6 0        2       No       Yes      No      Female Junior college
##  7 0        40      NA       NA       NA      Male   High school   
##  8 NA       NA      Yes      Yes      No      Female High school   
##  9 0        0       NA       NA       NA      Male   High school   
## 10 NA       NA      No       No       No      Male   Junior college
## # … with 2,857 more rows

Instagram and Snapchat, by sex

First, we will study the Instagram and Snapchat usage habits of the survey respondents.

# creating a new variable 'snap_insta'
gss_2 <- gss %>% 
          mutate(
            snap_insta = case_when(
              (snapchat=="Yes" | instagrm=="Yes") ~ "Yes",
              (snapchat=="NA" & instagrm=="NA") ~ "NA",
              TRUE ~ "No"
              )
            )
gss_2 
## # A tibble: 2,867 x 8
##    emailmin emailhr snapchat instagrm twitter sex    degree         snap_insta
##    <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <chr>          <chr>     
##  1 0        12      NA       NA       NA      Male   Bachelor       NA        
##  2 30       0       No       No       No      Male   High school    No        
##  3 NA       NA      No       No       No      Male   Bachelor       No        
##  4 10       0       NA       NA       NA      Female High school    NA        
##  5 NA       NA      Yes      Yes      No      Female Graduate       Yes       
##  6 0        2       No       Yes      No      Female Junior college Yes       
##  7 0        40      NA       NA       NA      Male   High school    NA        
##  8 NA       NA      Yes      Yes      No      Female High school    Yes       
##  9 0        0       NA       NA       NA      Male   High school    NA        
## 10 NA       NA      No       No       No      Male   Junior college No        
## # … with 2,857 more rows

To calculate the proportion of Yes’s for the new variable - snap_insta, among those who answered the question, we filter the data as below:

gss_3 <- gss_2 %>% 
        filter(snap_insta=="Yes" | snap_insta=="No") %>% # filtering out NAs
        summarise(proportion_of_yes = count(snap_insta=="Yes")/count(snap_insta!="NA"))
gss_3 
## # A tibble: 1 x 1
##   proportion_of_yes
##               <dbl>
## 1             0.375

Next, using the CI formula for proportions, we will construct 95% CIs for men and women who used either Snapchat or Instagram

# filtering data for men who used either SC or IG
gss_male <- gss_2 %>% 
  filter(snap_insta == "Yes" | snap_insta=="No") %>%
  filter(sex == "Male") %>%
   summarise(prop_yes_males = count(snap_insta=="Yes")/count(snap_insta!="NA")
             )

gss_male 
## # A tibble: 1 x 1
##   prop_yes_males
##            <dbl>
## 1          0.318
# filtering data for women who used either SC or IG
gss_female <- gss_2 %>% 
  filter(snap_insta == "Yes" | snap_insta=="No") %>%
  filter(sex == "Female") %>%
   summarise(prop_yes_females = count(snap_insta=="Yes")/count(snap_insta!="NA")
             )

gss_female
## # A tibble: 1 x 1
##   prop_yes_females
##              <dbl>
## 1            0.419
# The number of men for either SC or IG is calculated.
n_male <- gss_2 %>% 
  filter(snap_insta == "Yes" | snap_insta=="No") %>%
  filter(sex == "Male") %>% 
  nrow()
n_male
## [1] 603
# The number of women for either IG or SC is calculated.
n_female <- gss_2 %>% 
  filter(snap_insta == "Yes" | snap_insta=="No") %>%
  filter(sex == "Female") %>% 
  nrow()
n_female
## [1] 769
# The standard errors for both men and women are found.
se_scinsta_male <- sqrt(gss_male*(1-gss_male)/n_male)
se_scinsta_female <- sqrt(gss_female*(1-gss_female)/n_female)
  
# The average IG and SC men and women are found.
mean_scinsta_male <- se_scinsta_male/sqrt(n_male)
mean_scinsta_female <- se_scinsta_female/sqrt(n_female)

# The lower and upper confidence intervals are calculated from adding +/- 1.96*SE
lower_ci_male <- mean_scinsta_male - (1.96*se_scinsta_male)
upper_ci_male <- mean_scinsta_male + (1.96*se_scinsta_male)
lower_ci_female <- mean_scinsta_female - (1.96*se_scinsta_female)
upper_ci_female <- mean_scinsta_female + (1.96*se_scinsta_female)

paste("95% confidence interval for men who use either Snapchat or instagram is: ", round(lower_ci_male, 3), ",", round(upper_ci_male,3), " whereas that for women who use either Snapchat or instagram: ", round(lower_ci_female, 3), ",", round(upper_ci_female, 3) ) 
## [1] "95% confidence interval for men who use either Snapchat or instagram is:  -0.036 , 0.038  whereas that for women who use either Snapchat or instagram:  -0.034 , 0.036"

Twitter, by education level

First, we will convert degree from a character variable into a factor variable, making sure the order is correct - Lt high school, High school, Junior college, Bachelor and Graduate.

Second, for bachelors & graduates we’ll calculate the proportion of people who do and don’t use twitter.

Finally, using the CI formula for proportions, we’ll construct two 95% CIs for bachelors & graduates depending on whether they use twitter or not.

# converting degree to factor variable
twitter_cleaned <- gss %>% 
  mutate(
    degree = factor(degree, levels = c("Lt high school", "High school", "Junior college", "Bachelor", "Graduate"), 
                    ordered = TRUE)
        ) %>% 
  mutate(                                   # categorising people on basis of whether they're bachelors/graduates
    bachelor_graduate = case_when(
      (degree == "Bachelor") ~ "Yes",
      (degree == "Graduate") ~ "Yes",
      (degree == "NA") ~ "NA",
      TRUE ~ "No"
          )
        ) 
twitter_cleaned
## # A tibble: 2,867 x 8
##    emailmin emailhr snapchat instagrm twitter sex    degree     bachelor_gradua…
##    <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <ord>      <chr>           
##  1 0        12      NA       NA       NA      Male   Bachelor   Yes             
##  2 30       0       No       No       No      Male   High scho… No              
##  3 NA       NA      No       No       No      Male   Bachelor   Yes             
##  4 10       0       NA       NA       NA      Female High scho… No              
##  5 NA       NA      Yes      Yes      No      Female Graduate   Yes             
##  6 0        2       No       Yes      No      Female Junior co… No              
##  7 0        40      NA       NA       NA      Male   High scho… No              
##  8 NA       NA      Yes      Yes      No      Female High scho… No              
##  9 0        0       NA       NA       NA      Male   High scho… No              
## 10 NA       NA      No       No       No      Male   Junior co… No              
## # … with 2,857 more rows
# filtering bachelors & graduates to study their twitter usage
twitter_stats <- twitter_cleaned %>% 
  filter(bachelor_graduate=="Yes") %>% 
  summarise(
      prop_twitter_yes = count(twitter=="Yes")/count(twitter=="Yes" | twitter=="No"), 
      prop_twitter_no = count(twitter=="No")/count(twitter=="Yes" | twitter=="No")
            ) 

twitter_stats
## # A tibble: 1 x 2
##   prop_twitter_yes prop_twitter_no
##              <dbl>           <dbl>
## 1            0.233           0.767
# assigning values for calculating CIs
p_hat <- twitter_stats$prop_twitter_yes
size <- twitter_cleaned %>% 
            filter(twitter != "NA") %>% 
      nrow()

# calculating std error and mean                  
se_twitter <- sqrt(p_hat*(1-p_hat)/size)
mean_twitter <- se_twitter/sqrt(size)

# calculating lower and upper confidence interval  
twitter_lower_ci <- mean_twitter - (1.96*se_twitter) 
twitter_upper_ci <- mean_twitter + (1.96*se_twitter)

paste("95% confidence interval for Bachelors and Graduates who use Twitter is: ", round(twitter_lower_ci, 4), ",", round(twitter_upper_ci,4))
## [1] "95% confidence interval for Bachelors and Graduates who use Twitter is:  -0.0221 , 0.0227"

Email usage

To study the time survey respondents spend reading emails weekly, we will visual a variable which combines email hrs and email mins.

We’ll also find the mean and the median number of minutes to decide which is a better measure of the typical amount of time Americans spend on email weekly.

# convert email mins and hrs into numeric variable
gss_email <- gss_2 %>% 
    filter(emailhr != "NA") %>% 
    mutate(
      emailhr = as.numeric(emailhr), 
      emailmin = as.numeric(emailmin), 
      email_hours = (emailhr * 60), 
      email = email_hours + emailmin
        )

gss_email
## # A tibble: 1,649 x 10
##    emailmin emailhr snapchat instagrm twitter sex   degree snap_insta
##       <dbl>   <dbl> <chr>    <chr>    <chr>   <chr> <chr>  <chr>     
##  1        0      12 NA       NA       NA      Male  Bache… NA        
##  2       30       0 No       No       No      Male  High … No        
##  3       10       0 NA       NA       NA      Fema… High … NA        
##  4        0       2 No       Yes      No      Fema… Junio… Yes       
##  5        0      40 NA       NA       NA      Male  High … NA        
##  6        0       0 NA       NA       NA      Male  High … NA        
##  7        0       2 No       No       No      Fema… High … No        
##  8        0      10 No       No       No      Male  High … No        
##  9       30       0 No       No       No      Male  Lt hi… No        
## 10        0       1 NA       NA       NA      Fema… Lt hi… NA        
## # … with 1,639 more rows, and 2 more variables: email_hours <dbl>, email <dbl>
# finding mean and median
gss_summarise <-  gss_email %>% 
  summarise(mean_total = mean(email), 
            median_total = median(email) 
            )

gss_summarise
## # A tibble: 1 x 2
##   mean_total median_total
##        <dbl>        <dbl>
## 1       417.          120
# plotting the amount of mins spent weekly on emails 
email_plot <- gss_email %>% 
  ggplot(gss_email, mapping = aes(x = email)) +
  geom_boxplot() +
  theme_bw() +
  theme(
    plot.title=element_text(face="bold")
  ) +
  labs(
    title = "Check your Inbox!",
    subtitle = "Time Americans spend weekly reading their emails",
    y = "",
    x = "minutes"
  ) +
  NULL

email_plot

> From the above boxplot, we can conclude that there are a couple of outliers which influence the mean. Therefore, median would be a better measure of the amount of time a typical American spends reading his emails weekly.

Next, we’ll calculate a 95% bootstrap confidence interval for the mean amount of time Americans spend on email weekly.

library(infer)
set.seed(2468)

# bootstrap
whatever_id_email <- gss_email %>% 
  specify(response = email) %>% 
  generate(reps=1000, type="bootstrap") %>% 
  calculate(stat="mean")

# using lubridate to format dates  
tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), month, "1")),
         month = month(date, label=TRUE),
         Year = year(date)) 

# calculating confidence interval
email_CI <- whatever_id_email %>% 
  get_confidence_interval(gss_email$email, level=0.95, type="percentile")

email_CI
## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     387.     452.
lower_ci_hours <- round(email_CI$lower_ci / 60, 0)
lower_ci_mins <- email_CI$lower_ci %% 60
upper_ci_hours <- round(email_CI$upper_ci / 60, 0)
upper_ci_mins <- email_CI$upper_ci %% 60

paste("95% confidence interval for email hours is:", lower_ci_hours, "hr", round(lower_ci_mins,0), "minutes,", upper_ci_hours, "hr", round(upper_ci_mins), "minutes")
## [1] "95% confidence interval for email hours is: 6 hr 27 minutes, 8 hr 32 minutes"

If we were to calculate a 99% confidence interval instead, we would expect it to be wider. To be more confident that the true population mean will lie within the confidence interval, we expand the range of the interval itself to allow more potential values to fall within it. Therefore, greater confidence comes at the cost of a wider range.

Trump’s Approval Margins

# Import approval polls data
approval_polllist <- read_csv(here::here('data', 'approval_polllist.csv'))

# Quickly view data set.
glimpse(approval_polllist)
## Rows: 14,533
## Columns: 22
## $ president           <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls…
## $ modeldate           <chr> "8/29/2020", "8/29/2020", "8/29/2020", "8/29/2020…
## $ startdate           <chr> "1/20/2017", "1/20/2017", "1/20/2017", "1/21/2017…
## $ enddate             <chr> "1/22/2017", "1/22/2017", "1/24/2017", "1/23/2017…
## $ pollster            <chr> "Gallup", "Morning Consult", "Ipsos", "Gallup", "…
## $ grade               <chr> "B", "B/C", "B-", "B", "B", "C+", "B-", "B+", "B"…
## $ samplesize          <dbl> 1500, 1992, 1632, 1500, 1500, 1500, 1651, 1190, 2…
## $ population          <chr> "a", "rv", "a", "a", "a", "lv", "a", "rv", "a", "…
## $ weight              <dbl> 0.262, 0.680, 0.153, 0.243, 0.227, 0.200, 0.142, …
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ approve             <dbl> 45.0, 46.0, 42.1, 45.0, 46.0, 57.0, 42.3, 36.0, 4…
## $ disapprove          <dbl> 45.0, 37.0, 45.2, 46.0, 45.0, 43.0, 45.8, 44.0, 3…
## $ adjusted_approve    <dbl> 45.8, 45.3, 43.2, 45.8, 46.8, 51.6, 43.4, 37.7, 4…
## $ adjusted_disapprove <dbl> 43.6, 37.8, 43.9, 44.6, 43.6, 44.4, 44.5, 42.8, 3…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tracking            <lgl> TRUE, NA, TRUE, TRUE, TRUE, TRUE, TRUE, NA, NA, T…
## $ url                 <chr> "http://www.gallup.com/poll/201617/gallup-daily-t…
## $ poll_id             <dbl> 49253, 49249, 49426, 49262, 49236, 49266, 49425, …
## $ question_id         <dbl> 77265, 77261, 77599, 77274, 77248, 77278, 77598, …
## $ createddate         <chr> "1/23/2017", "1/23/2017", "3/1/2017", "1/24/2017"…
## $ timestamp           <chr> "13:38:37 29 Aug 2020", "13:38:37 29 Aug 2020", "…
# Data set cleaned.
approval_polllist_clean <- approval_polllist %>%
    select(c(enddate, samplesize, adjusted_approve, adjusted_disapprove)) %>% 
    mutate(approval_difference = adjusted_approve - adjusted_disapprove)
approval_polllist_clean
## # A tibble: 14,533 x 5
##    enddate   samplesize adjusted_approve adjusted_disapprove approval_difference
##    <chr>          <dbl>            <dbl>               <dbl>               <dbl>
##  1 1/22/2017       1500             45.8                43.6               2.18 
##  2 1/22/2017       1992             45.3                37.8               7.44 
##  3 1/24/2017       1632             43.2                43.9              -0.659
##  4 1/23/2017       1500             45.8                44.6               1.18 
##  5 1/24/2017       1500             46.8                43.6               3.18 
##  6 1/24/2017       1500             51.6                44.4               7.15 
##  7 1/25/2017       1651             43.4                44.5              -1.06 
##  8 1/25/2017       1190             37.7                42.8              -5.08 
##  9 1/25/2017       2692             41.9                36.9               4.99 
## 10 1/26/2017       1678             43.7                45.1              -1.36 
## # … with 14,523 more rows
# Use `lubridate` to fix dates, as they are given as characters.

Create a plot

The Average net approval rating of Trump along with the 95% confidence intervals are calculated below.

# The data set of approval ratings are cleaned along with finding the confidence intervals.
Trump_Approval <- approval_polllist_clean %>%
  mutate(enddate = mdy(enddate),
         year = year(enddate),
         week = week(enddate)
         ) %>%
  group_by(year, week) %>% 
        summarise(Weekly_difference = sum(approval_difference),
                  avg_difference = mean(approval_difference),
                  se_Trump = sd(approval_difference)/sqrt(n()),
                  t_crit = qt(0.975, n() - 1),
                  lower_CIs = (avg_difference - t_crit * se_Trump),
                  upper_CIs = (avg_difference + t_crit * se_Trump)
                  )
Trump_Approval
## # A tibble: 191 x 8
## # Groups:   year [4]
##     year  week Weekly_differen… avg_difference se_Trump t_crit lower_CIs
##    <dbl> <dbl>            <dbl>          <dbl>    <dbl>  <dbl>     <dbl>
##  1  2017     4             64.2           1.39    0.698   2.01   -0.0102
##  2  2017     5           -120.           -1.79    0.517   2.00   -2.82  
##  3  2017     6           -290.           -3.82    0.612   1.99   -5.04  
##  4  2017     7           -541.           -6.44    0.615   1.99   -7.66  
##  5  2017     8           -528.           -6.07    0.519   1.99   -7.10  
##  6  2017     9           -330.           -4.23    0.508   1.99   -5.24  
##  7  2017    10           -348.           -3.91    0.575   1.99   -5.05  
##  8  2017    11           -666.           -7.83    0.599   1.99   -9.02  
##  9  2017    12           -636.           -7.57    0.754   1.99   -9.07  
## 10  2017    13          -1060.          -11.6     0.726   1.99  -13.1   
## # … with 181 more rows, and 1 more variable: upper_CIs <dbl>
# The approval plot is created along with the confidence intervals.
Approval_Plot <- Trump_Approval %>%
  ggplot(aes(x = week, y = avg_difference)) +
  geom_ribbon(mapping = aes(ymin = lower_CIs, ymax = upper_CIs), alpha = 0.3)+
  geom_line() +
  geom_point() +
  facet_wrap(~ year) +
  geom_hline(mapping = aes(yintercept = 0), color = "orange") +
  theme(legend.position = "none",
        panel.background = element_rect("white", "black"),
        panel.grid.major = element_line("light grey"),
        panel.grid.minor = element_line("light grey")
        ) +
  labs (
    title = "Estimate Net Approval (approve-disaprove) for Donald Trump",
    subtitle = "Weekly average of all polls",
    y = "Average Net Approval (%)",
    x = "Week of the year"
  ) +
  expand_limits(x = c(0, 52), y = 7.5) +
  NULL

Approval_Plot

Ideal picture of plot.

Compare Confidence Intervals

As seen in the approval plot, mid-April shows that the confidence intervals are closer compared to larger confidence intervals in mid August. As standard error is proportional to the variance of trump’s approval ratings, national events that happened mid August, such as the BLM protests and the handling of the Covid crisis, has made the public’s view of Trump far more polarized. Because of this, the standard errors increase widening the confidence intervals.

Gapminder revisited

In this part, we will join a few dataframes with data on:

First, we will compile and clean the dataframes.

# load gapminder HIV and life Expectancy data, turn the data frames into one format
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv")) %>% 
    pivot_longer(cols = 2:34, names_to = "date", values_to = "hiv_prv")%>% 
    mutate(date = as.numeric(date),
           hiv_prv = as.numeric(hiv_prv))
NULL
## NULL
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv")) %>% 
    pivot_longer(cols = 2:302, names_to = "date", values_to = "life_exp") %>% 
    mutate(date = as.numeric(date),
           life_exp = as.numeric(life_exp))
NULL
## NULL
# get World bank data from local due to the difficulty of connection
worldbank_data <- read_csv(here::here("data","worldbank_data.csv"))

# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels,  from the World Bank API 
countries <-  read_csv(here::here("data","countries.csv"))

#merge the 3 data frames using left_join which is more efficient
merge_wbd <- worldbank_data %>% 
  left_join(., life_expectancy, by = c("country","date")) %>% 
  left_join(., hiv, by = c("country", "date")) %>% 
  mutate(region = countries$region[match(country, countries$country)]) 

#clean the merged data frame
names(merge_wbd)[names(merge_wbd) == "SP.DYN.TFRT.IN"] <- "fertility_rate"
names(merge_wbd)[names(merge_wbd) == "NY.GDP.PCAP.KD"] <- "GDP_cap"
names(merge_wbd)[names(merge_wbd) == "SE.PRM.NENR"] <- "school_enroll"
names(merge_wbd)[names(merge_wbd) == "SH.DYN.MORT"] <- "mortality_rate"

After merging the data, we are to answer the following questions:

  1. What is the relationship between HIV prevalence and life expectancy?
#plot the relationship between HIV prevalence and life expectancy
ggplot(merge_wbd, aes(x = life_exp, y = hiv_prv)) + 
  geom_point(size = 0.2) + 
  geom_smooth(alpha = 0.5) +
  facet_wrap(~region, scales = "free") +
  theme_bw() +
  labs(title = "Relationship Between Life Expectancy and HIV Prevalance",
       x = "Life Expectancy",
       y = "HIV Prevalance")

  1. What is the relationship between fertility rate and GDP per capita?
#plot the relationship between fertility rate and GDP per capita
ggplot(merge_wbd, aes(x = fertility_rate, y = GDP_cap)) + 
  geom_point(size = 0.2) + 
  geom_smooth() +
  facet_wrap(~region, scales = "free") +
  theme_bw() +
  labs(title = "Relationship Between Fertility Rate and GDP per Capita",
       x = "Fertility Rate",
       y = "GDP per Capita") 

  1. Which regions have the most observations with missing HIV data?
#find the regions with most missing HIV data
library("scales") # package to format data in percentage 
merge_wbd %>% 
  select(country, region, date, hiv_prv) %>% 
  group_by(region) %>% 
  summarize(data_total = NROW(hiv_prv),
            na_total = sum(is.na(hiv_prv)),
            na_pct = scales::percent(round(na_total/data_total,2))) %>%
  ungroup() %>% 
  ggplot(aes(x = na_total, y = reorder(region, na_total))) + 
  geom_col() +
  labs(title = "Europe & Central Asia Has The Most Missing Data in HIV", 
       subtitle = "Number of Observations & Percentage of Total Regional Observation",
       x = "Number of Missing HIV Data",
       y = NULL) +
  theme_bw(base_size = 12) +
  ggrepel::geom_text_repel(aes(label = na_pct), nudge_x = 1, size = 3) +
  NULL

  1. How has mortality rate for under 5 changed by region? In each region, what are the top 5 countries that have seen the greatest improvement, as well as those 5 countries where mortality rates have had the least improvement or even deterioration?
#Plot the child(under 5) mortality rate trend by regions
merge_wbd %>% 
  select(region, country, mortality_rate, date) %>% 
  drop_na(., mortality_rate) %>% 
  group_by(region, date) %>% 
  summarize(mean_mort = mean(mortality_rate)) %>% 
  ungroup() %>% 
  ggplot(aes(x = date, y = mean_mort)) + 
  geom_line() + 
  facet_wrap(~region, scales = "free") +
  labs(title = "Regional Child Mortality Rate Is Descending(1960 - 2016)",
       x = NULL,
       y = "Mean Mortality Rate") +
  theme_bw() +
  NULL

#Find the top 5 improved countries with data from 2000 to 2016 as many countries do not have earlier data
new_merge_wbd <- merge_wbd %>% 
  select(region, country, mortality_rate, date) %>% 
  filter(date %in% c("2000","2016")) %>% 
  pivot_wider(names_from = date, values_from = mortality_rate) %>% 
  drop_na()

names(new_merge_wbd)[3] <- "Y2000" #rename the year columns as they are hard to reference
names(new_merge_wbd)[4] <- "Y2016"

most_improved_mortality_rate <- new_merge_wbd %>% 
  mutate(mort_impr = (Y2016 - Y2000)/Y2000) %>% 
  group_by(region) %>% 
  slice_min(order_by = mort_impr, n = 5) %>% 
  ungroup() %>% 
  ggplot(aes(x = mort_impr, y = reorder(country, mort_impr))) +
  geom_col() +
  facet_wrap(~region, scales = "free") +
  labs(title = "5 Countries with the Most Improvement in Mortality Rate 2000-2016",
       x = "Percentage Change of Mortality Rate",
       y = NULL) +
  theme_bw() +
  NULL
most_improved_mortality_rate

least_improved_mortality_rate <- new_merge_wbd %>% 
  mutate(mort_impr = (Y2016 - Y2000)/Y2000) %>% 
  group_by(region) %>% 
  slice_max(order_by = mort_impr, n = 5) %>% 
  ungroup() %>% 
  ggplot(aes(x = mort_impr, y = reorder(country, mort_impr))) +
  geom_col() +
  facet_wrap(~region, scales = "free") +
  labs(title = "5 Countries with the Least Improvement in Mortality Rate 2000-2016",
       x = "Percentage Change of Mortality Rate",
       y = NULL) +
  theme_bw() +
  NULL

least_improved_mortality_rate

> The mortality rate for under 5 shows a decreasing trend amongst all regions from 1960 to 2016. As for improvements over the years, because some countries lack data from 1960, we decide to compare data from 2016 to data 2000 to inlcude as many countries as possible to discover the improvements of mortality rate. A negative percentage change shows a country has improved/decreased its mortality rate, whereas a positive percentage change shows deterioration. Hence 5 countries’ mortality rate deterioated during 2000-2016.

  1. Is there a relationship between primary school enrollment and fertility rate?
#plot the relationship between primary school enrollment and fertility rate
merge_wbd %>% 
  select(region, country, school_enroll, fertility_rate) %>%
  drop_na(., c(school_enroll, fertility_rate)) %>% 
  group_by(region, country) %>% 
  ggplot(aes(x = school_enroll, y = fertility_rate)) +
  geom_point(size = 0.5, alpha = 0.3) +
  geom_smooth() +
  facet_wrap(~region, scales = "free") +
  labs(title = "Negative Correlation Between Primary School Enrollment and Fertility Rate",
       x = "Primary School Enrollment Rate",
       y = "Fertility Rate") +
  theme_bw() +
  NULL

> There is a negative correlation between primary school enrollment and fertility rate in all regions.